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Abstract 

Several researchers have experimentally shown that substantial improvements can be ob- 
tained in difficult pattern recognition problems by combining or integrating the outputs of 
multiple classifiers. This chapter provides an analytical framework to quantify the improve- 
ments in classification results due to combining. The results apply to both linear combiners 
and order statistics combiners. We first show that to a first order approximation, the error 
rate obtained over and above the Bayes error rate, is directly proportional to the variance of 
the actual decision boundaries around the Bayes optimum boundary. Combining classifiers 
in output space reduces this variance, and hence reduces the "added" error. If A'' unbi- 
ased classifiers are combined by simple averaging, the added error rate can be reduced by 
a factor of N if the individual errors in approximating the decision boundaries are uncorre- 
lated. Expressions are then derived for linear combiners which are biased or correlated, and 
the effect of output correlations on ensemble performance is quantified. For order statistics 
based non-linear combiners, we derive expressions that indicate how much the median, the 
maximum and in general the ith order statistic can improve classifier performance. The 
analysis presented here facilitates the understanding of the relationships among error rates, 
classifier boundary distributions, and combining in output space. Experimental results on 
several public domain data sets are provided to illustrate the benefits of combining and to 
support the analytical results. 



1 Introduction 

Training a parametric classifier involves the use of a training set of data with known labehng 
to estimate or "learn" the parameters of the chosen model. A test set, consisting of patterns 
not previously seen by the classifier, is then used to determine the classification performance. 
This ability to meaningfully respond to novel patterns, or generalize, is an important aspect of 
a classifier system and in essence, the true gauge of performance ^8, 77 1. Given infinite training 



data, consistent classifiers approximate the Bayesian decision boundaries to arbitrary precision, 
therefore providing similar generalizations | |2^ . However, often only a limited portion of the 
pattern space is available or observable Given a finite and noisy data set, different clas- 

sifiers typically provide different generalizations by realizing different decision boundaries [^ . 

*Appears in Combining Artificial Neural Networks, Ed. Amanda Sharkey, pp 127-162, Springer Verlag, 1999. 



For example, when classification is performed using a multilayered, feed-forward artificial neural 
network, different weight initializations, or different architectures (number of hidden units, hid- 
den layers, node activation functions etc.) result in differences in performance. It is therefore 
beneficial to train an ensemble of classifiers when approaching a classification problem to ensure 
that a good model/parameter set is found. 

Techniques such as cross-validation also lead to multiple trained classifiers. Selecting the 
"best" classifier is not necessarily the ideal choice, since potentially valuable information may 
be wasted by discarding the results of less-successful classifiers. This observation motivates the 
concept of "combining" wherein the outputs of all the available classifiers are pooled before a 
decision is made. This approach is particularly useful for difficult problems, such as those that 
involve a large amount of noise, limited number of training data, or unusually high dimensional 
patterns. The concept of combining appeared in the neural network literature as early as 1965 

and has subsequently been studied in several forms, including stacking fz^ , boosting [^5[ 
[l9| , ^ and bagging 0. Combining has also been studied in other fields such as econometrics, 
under the name "forecast combining" psf , or machine learning where it is called "evidence 
combination" P, |23[ |. The overall architecture of the combiner form studied in this article is 
shown in Figure The output of an individual classifier using a single feature set is given by 
jind ]y[uitiple classifiers, possibly trained on different feature sets, provide the combined output 

jcomh 

Currently, the most popular way of combining multiple classifiers is via simple averaging of 
the corresponding output values ^ |4^, Weighted averaging has also been proposed, 
along with different methods of computing the proper classifier weights Such 
linear combining techniques have been mathematically analyzed for both regression Q and 
classification problems. Order statistics combiners that selectively pick a classifier on a 
per sample basis were introduced in |7^. Other non- linear methods, such as rank-based 
combiners [|l|, or voting schemes |ri|, |3^ have been investigated as well. Methods for 
combining beliefs in the Dempster-Shafer sense are also available Combiners 
have been successfully applied a multitude of real world problems [|[ |[ ^ 

Combining techniques such as majority voting can generally be applied to any type of classifier, 
while others rely on specific outputs, or specific interpretations of the output. For example, the 
confidence factors method found in machine learning literature relies on the interpretation of the 
outputs as the belief that a pattern belongs to a given class The rationale for averaging, on 
the other hand, is based on the result that the outputs of parametric classifiers that are trained to 
minimize a cross-entropy or mean square error (MSE) function, given "o7ie-o/-L" desired output 
patterns, approximate the a posteriori probability densities of the corresponding class ^t\ . 
In particular, the MSE is shown to be equivalent to: 



MSE = Ki + Y,f A(x) [p{C,\x) - U{x)f dx 



where Ki and Di(x) depend on the class distributions only, fi{x) is the output of the node rep- 
resenting class i given an output x, p{Ci\x) denotes the posterior probability and the summation 
is over all classes . Thus minimizing the MSE is equivalent to a weighted least squares fit of 
the network outputs to the corresponding posterior probabilities. 

In this article we first analytically study the effect of linear and order statistics combining in 
output space with a focus on the relationship between decision boundary distributions and error 
rates. Our objective is to provide an analysis that: 
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Figure 1: Combining Strategy. The solid lines leading to represent the decision of a specific 
classifier, while the dashed lines lead to f^"^^^ the output of the combiner. 

• encapsulates the most commonly used combining strategy, averaging in output space; 

• is broad enough in scope to cover non-linear combiners; and 

• relates the location of the decision boundary to the classifier error. 

The rest of this article is organized as follows. Section || introduces the overall framework for 
estimating error rates and the effects of combining. In Section ^ we analyze linear combiners, 
and derive expressions for the error rates for both biased and unbiased classifiers. In Section ^, 
we examine order statistics combiners, and analyze the resulting classifier boundaries and error 
regions. In Section ^ we study linear combiners that make correlated errors, derive their error 
reduction rates, and discuss how to use this information to build better combiners. In Section ^, 
we present experimental results based on real world problems, and we conclude with a discussion 
of the implications of the work presented in this article. 



2 Class Boundary Analysis and Error Regions 

Consider a single classifier whose outputs are expected to approximate the corresponding a 
posteriori class probabilities if it is reasonably well trained. The decision boundaries obtained by 
such a classifier are thus expected to be close to Bayesian decision boundaries. Moreover, these 
boundaries will tend to occur in regions where the number of training samples belonging to the 
two most locally dominant classes (say, classes i and j) are comparable. 

We will focus our analysis on network performance around the decision boundaries. Consider 
the boundary between classes i and j for a singic-dimensional input (the extension to multi- 
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dimensional inputs is discussed in [7^). First, let us express the output response of the ith unit 
of a one- of -L classifier network to a given input x ag|^ 



fi{x) = pi{x) + e^{x) , 



(1) 



where Pi{x) is the a posteriori probability distribution of the zth class given input x, and ei(x) is 
the error associated with the ith output^. 




Figure 2: Error regions associated with approximating the a posteriori probabilities. Lightly 
shaded region represents the Bayes error, while the darkly shaded area represents the additional 
error due to classifier /. 

For the Bayes optimum decision, a vector x is assigned to class i if Pi{x) > Pk{x) , ^ i. 
Therefore, the Bayes optimum boundary is the loci of all points x* such that Pi{x*) = pj{x*) 
where pj{x*) = maxk^a Pk{x). Since our classifier provides /i(-) instead ofpi(-), the decision 
boundary obtained, may vary from the optimum boundary (see Figure H). Let b denote the 
amount by which the boundary of the classifier differs from the optimum boundary [b = xi, — x*). 
We have: 

h{x* +b) = fj{x* +b), 
by definition of the boundary. This implies: 

p^(x* +b) + e,{xb) = pj{x* +b) + e.j{xb) ■ (2) 

Within a suitably chosen region about the optimum boundary, the a posteriori probability 
of the correct class monotonically increases relative to the others as we move away from the 
boundary. This suggests a linear approximation of pk (x) around x* : 

Pk{x* +b)~ pk{x*) + bp'kix*) , Vfc , (3) 

^If two or more classifiers need to be distinguished, a superscript is added to fi{x) and ti{x) to indicate the 
classifier number. 

■^Here, Piix) is used for simplicity to denote p{Ci\x). 
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where denotes the derivative of Pk{-)- With this substitution, Equation]^ becomes: 

p.,{x*) + bp[{x*) + e,{xb) = Pj{x*) + bp'jix*) + e^ixb). (4) 
Now, since Pi{x*) =pj{x*), we get: 

b{p'j{x*) - p'^ix*)) = e^{xb) - ej{xb). 



Finally we obtain: 



where: 



s = p'^ix*) - p[{x*). (6) 

Let the error ei{xb) be broken into a bias and noise term {ei{xb) = Pi + rji^Xb)). Note that 
the term "bias" and "noise" are only analogies, since the error is due to the classifier as well as 
the data. For the time being, the bias is assumed to be zero (i.e. /3fe = Vfc). The case with 
nonzero bias will be discussed at the end of this section. Let a^^ denote the variances of r]k{x), 
which are taken to be i.i.d. variable^ Then, the variance of the zero-mean variable 6 is given 
by (using Equation^): 

'^b = (7) 

Figure ^ shows the a posteriori probabilities obtained by a non-ideal classifier, and the associ- 
ated added error region. The lightly shaded area provides the Bayesian error region. The darkly 
shaded area is the added error region associated with selecting a decision boundary that is offset 
by 6, since patterns corresponding to the darkly shaded region are erroneously assigned to class 
i by the classifier, although ideally they should be assigned to class j. 

The added error region, denoted by A{b), is given by: 

A{b) = r ^\p,{x)-p,{x))dx. (8) 
Based on this area, the expected added error, Eadd-, is given by: 

/•oo 

EaM = / A(6)/b(6)d&, (9) 



where jh is the density function for 6. More exphcitly, the expected added error is: 

/•oo /'X*+6 

Eadd^ / / {pj{x) - pi{x)) fb{b) dxdb. 



One can compute A{b) directly by using the approximation in Equation ^ and solving Equa- 
tion ^ The accuracy of this approximation depends on the proximity of the boundary to the 

^Each output of each network does approximate a smooth function, and therefore the noise for two nearby 
patterns on the same class (i.e. r)(j(x) and r)j.(a; -|- Ax)) is correlated. The independence assumption applies to 
inter-class noise (i.e. rii{x) and T]j{x)), not intra-class noise. 
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ideal boundary. However, since in general, the boundary density decreases rapidly with increas- 
ing distance from the ideal boundary, the approximation is reasonable at least for the most likely 
(i.e. small) values of b. This leads to: 

r + {x-x*)p-{x*)) - iM^l + {x-x*)p',{x*)))dx. 

J X* 

or: 

Aib) = ^b's, (10) 

where s is given by Equation ^. 

Equation ^ shows how the error can be obtained directly from the density function of the 
boundary offset. Although obtaining the exact form of the density function for b is possible (it 
is straightforward for linear combiners, but convoluted for order statistics combiners), it is not 
required. Since the area given in Equation |l0| is a polynomial of the second degree, we can find 
its expected value using the first two moments of the distribution of b. Let us define the first 
and second moments of the the boundary offset: 



Ml — / xfb{x)dx. 

J — oo 

and: 

/•oo 

M2 = / x^fb{x)dx. 



00 



Computing the expected error for a combiner reduces to solving: 



Eadd^ I \b'sh{b)db, 



in terms of Mi and M2, leading to: 

EaM^"^ f b'Mb)db^'-^. (11) 

The offset 6 of a single classifier without bias has Mi — and M2 — erf, leading to: 

Eadd = (12) 

Of course. Equation |l^ only provides the added error. The total error is the sum of the added 
error and the Bayes error, which is given by: 

Etot — Ebay + Eadd- (13) 

Now, if the classifiers are biased, we need to proceed with the assumption that ek{x) — 
f3k + Vk{x) where Pk 7^ 0. The boundary offset for a single classifier becomes: 

^ ^ mM-Vjjxb) ^ P^ - Pj ^^^^ 
s s 

In this case, the variance of b is left unchanged (given by Equation]^), but the mean becomes 
(3 = ■ In other words, we have Mi = [3 and af = M2 — Mi^, leading to the following added 
error: 

Eadd{(i) = '-^ = '^{<ri+n (15) 
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For analyzing the error regions after combining and comparing them to the single classifier 
case, one needs to determine how the first and second moment of the boundary distributions are 
affected by combining. The bulk of the work in the following sections focuses on obtaining those 
values. 



3 Linear Combining 

3.1 Linear Combining of Unbiased Classifiers 

Let us now divert our attention to the effects of linearly combining multiple classifiers. In what 
follows, the combiner denoted by ave performs an arithmetic average in output space. If N 
classifiers are available, the ith output of the ave combiner provides an approximation to Pi{x) 
given by: 



1 ^ 

/r (^) = ^T. /r(^), (16) 



m— 1 



1 ^ 



m— 1 



or: 



where: 



and 

1 ^ 

m—l 

If the classifiers are unbiased, f3i = 0. Moreover, if the errors of different classifiers are i.i.d., the 
variance of fji is given by: 

1 ^ 1 

4 = ]^ E ^'r = N< ■ (17) 

m— 1 

The boundary x""*^ then has an offset b'^'"'^, where: 
and: 

The variance of fe'""^, a^a^^, can be computed in a manner similar to cr^, resulting in: 

4 + 4 

<7ha^c = K , 



which, using Equation ^ leads to: 



2 4+4 
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'^U = ^. (19) 

Qualitatively, this reduction in variance can be readily translated into a reduction in error 
rates, since a narrower boundary distribution means the likelihood that a boundary will be 
near the ideal one is increased. In effect, using the evidence of more than one classifier reduces 
the variance of the class boundary, thereby providing a "tighter" error-prone area. In order to 
establish the exact improvements in the classification rate, we need to compute the expected 
added error region, and explore the relationship between classifier boundary variance and error 
rates. 

To that end, let us return to the added error region analysis. For the ave classifier, the first 
and second moments of the boundary offset, 6"'"'^, are: Aff^"^ = and A/f^"^ = crlav^. Using 

I 1 cr^ 

Equation |19|, we obtain Mf'^ = The added error for the ave combiner becomes: 

Km - - 2 '^fc-- - J^^" - ^^^> 

Equation ^ quantifies the improvements due to combining N classifiers. Under the assump- 
tions discussed above, combining in output space reduces added error regions by a factor of N. 
Of course, the total error, which is the sum of Bayes error and the added error, will be reduced 
by a smaller amount, since Baycsian error will be non-zero for problems with overlapping classes. 
In fact, this result, coupled with the reduction factor obtained is Section 5.2, can be used to 
provide estimates for the Bayes error rate . 



3.2 Linear Combining of Biased Classifiers 

In general. Pi is nonzero since at least one classifier is biased. In this case, the boundary offset 
f^ave j-jgcomes: 

s s 

The variance of fji{x) is identical to that of the unbiased case, but the mean of 6""^ is given by 
(3 where 

/3=^^. (22) 
s 

The effect of combining is less clear in this case, since the average bias {(3) is not necessarily less 
than each of the individual biases. Let us determine the first and second moments of fo'""^. We 
have Mf'^ = P, and cr^„„, = Ma™'^ - (Mf^'=)2, leading to: 

n,Tave 



which is: 



2 

where j3 = and z > I. Now let us limit the study to the case where z < \/N . The: 



EITM = ^ f4 + 5 ) (23) 



KTAP) < 2 



*If z > \/N , then the reduction of the variance becomes the hmiting factor, and the reductions established in 
the previous section hold. 
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leading to: 

KZCP) < ^EaddiP). (24) 

Equation quantifies the error reduction in the presence of network bias. The improvements 
are more modest than those of the previous section, since both the bias and the variance of the 
noise need to be reduced. If both the variance and the bias contribute to the error, and their 
contributions are of similar magnitude, the actual reduction is given by min{z'^,N). If the bias 
can be kept low (e.g. by purposefully using a larger network than required), then once again N 
becomes the reduction factor. These results highlight the basic strengths of combining, which 
not only provides improved error rates, but is also a method of controlling the bias and variance 
components of the error separately, thus providing an interesting solution to the bias/variance 
problem Q. 

4 Order Statistics 

4.1 Introduction 

Approaches to pooling classifiers can be separated into two main categories: simple combiners, 
e.g., averaging, and computationally expensive combiners, e.g., stacking. The simple combining 
methods are best suited for problems where the individual classifiers perform the same task, and 
have comparable success. However, such combiners are susceptible to outliers and to unevenly 
performing classifiers. In the second category, "meta-learners," i.e., either sets of combining rules, 
or full fiedged classifiers acting on the outputs of the individual classifiers, are constructed. This 
type of combining is more general, but suffers from all the problems associated with the extra 
learning (e.g., overparameterizing, lengthy training time). 

Both these methods are in fact ill-suited for problems where most (but not all) classifiers 
perform within a well-specified range. In such cases the simplicity of averaging the classifier 
outputs is appealing, but the prospect of one poor classifier corrupting the combiner makes 
this a risky choice. Although, weighted averaging of classifier outputs appears to provide some 
flexibility, obtaining the optimal weights can be computationally expensive. Furthermore, the 
weights are generally assigned on a per classifier, rather than per sample or per class basis. If a 
classifier is accurate only in certain areas of the inputs space, this scheme fails to take advantage 
of the variable accuracy of the classifier in question. Using a meta learner that would have 
weights for each classifier on each pattern, would solve this problem, but at a considerable cost. 
The robust combiners presented in this section aim at bridging the gap between simplicity and 
generality by allowing the fiexible selection of classifiers without the associated cost of training 
meta classifiers. 

4.2 Background 

In this section we will briefiy discuss some basic concepts and properties of order statistics. Let 
X be a random variable with a probability density function /x(')> ^"^d cumulative distribution 
function Fx{-)- Let (Xi,X2, ■ ■ ■ , ^at) be a random sample drawn from this distribution. Now, 
let us arrange them in non-decreasing order, providing: 
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The ith order statistic denoted by Xi^N, is the ith value in this progression. The cumulative 
distribution function for the smallest and largest order statistic can be obtained by noting that: 



and: 

FX,..„ {X) - P{X^:N <X) = \- P{Xl..N >x) = l- Il^^^P{X,.,N > x) 

= 1 - (1 - Ul,P{X,..M <x) = l-[l- Fx{x)f 

The corresponding probability density functions can be obtained from these equations. In general, 
for the ith order statistic, the cumulative distribution function gives the probability that exactly 
i of the chosen X's are less than or equal to x. The probability density function of Xi-x is then 
given by ||l|l: 

/^-(^) = (,_l),(iv_,), iFx{x)Y-' [1 - Fx{x)f-' fx{x) . (25) 

This general form however, cannot always be computed in closed form. Therefore, obtaining 
the expected value of a function of x using Equation ^ is not always possible. However, the 
first two moments of the density function are widely available for a variety of distributions j^] . 
These moments can be used to compute the expected values of certain specific functions, e.g. 
polynomials of order less than two. 



4.3 Combining Unbiased Classifiers through OS 

Now, let us turn our attention to order statistic combiners. For a given input x, let the network 
outputs of each of the N classifiers for each class i be ordered in the following manner: 

fr{x)<fri^)< ••• </f^^(x). 

Then, the max, med and min combiners are defined as follows [l^ : 

(x) = /f^(x), (26) 

f 2 ■ (a,) +/.2 + ■ (x) .„ . 

-^-—^ If IS even ^^T) 

/, ^ (x) if N is odd, 
/r"(x) = ft'^ix). (28) 

These three combiners are chosen because they represent important qualitative interpretations 
of the output space. Selecting the maximum combiner is equivalent to selecting the class with 
the highest posterior. Indeed, since the network outputs approximate the class a posteriori 
distributions, selecting the maximum reduces to selecting the classifier with the highest confidence 
in its decision. The drawback of this method however is that it can be compromised by a single 
classifier that repeatedly provides high values. The selection of the minimum combiner follows 
a similar logic, but focuses on classes that are unlikely to be correct, rather than on the correct 
class. Thus, this combiner eliminates less likely classes by basing the decision on the lowest value 
for a given class. This combiner suffers from the same ills as the max combiner, although it is 
less dependent on a single error, since it performs a min-max operation, rather than a max-maxQ. 

^Recall that the pattern is ultimately assigned to the class with the highest combined output. 
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The median classifier on the other hand considers the most "typical" representation of each class. 
For highly noisy data, this combiner is more desirable than either the mm or max combiners since 
the decision is not compromised as much by a single large error. 

The analysis of the properties of these combiners does not depend on the order statistic 
chosen. Therefore we will denote all three by f°''{x) and derive the error regions. The network 
output provided by /""(a;) is given by: 

/r(a;)=K(x) + 6r(^), (29) 

Let us first investigate the zero-bias case {(3u = V/c). We get e^'*(a;) = rjl^ix) VA:, since the 
variations in the fcth output of the classifiers are solely due to noise. Proceeding as before, the 
boundary is shown to be: 

h°' = -^^—^ (30) 

s 

Since r^^'s are i.i.d, and yy^" is the same order statistic for each class, the moments will be identical 
for each class. Moreover, taking the order statistic will shift the mean of both rjf^ and 77°* by the 
same amount, leaving the mean of the difference unaffected. Therefore, 6°* will have zero mean, 
and variance: 

where a is a reduction factor that depends on the order statistic and on the distribution of h. 
For most distributions, a can be found in tabulated form For example. Table |l| provides a 
values for all three os combiners, up to 15 classifiers, for a Gaussian distribution [|[ 

Returning to the error calculation, we have: Aff" = 0, and A/I'' = ago. , providing: 

Equation ^ shows that the reduction in the error region is directly related to the reduction 
in the variance of the boundary offset 6. Since the means and variances of order statistics for 
a variety of distributions are widely available in tabular form, the reductions can be readily 
quantified. 



4.4 Combining Biased Classifiers through OS 

In this section, we analyze the error regions in the presence of bias. Let us study in detail 
when multiple classifiers are combined using order statistics. First note that the bias and noise 
cannot be separated, since in general (a + 6)°" 7^ a°'^ + . We will therefore need to specify the 
mean and variance of the result of each operation^. Equation |3^ becomes: 

s 

Now, f3k has mean 0^, given by X]m=i ' where m denotes the different classifiers. Since 
the noise is zero-mean, Pk + rjkixb) has first moment Pk and variance ct^^ -I- cr^^, where cr|^ = 

® Since the exact distribution parameters of are not known, we use the sample mean and the sample variance. 
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Table 1: Reduction factors a for the min, max and med combiners. 



N 


OS Combiners 


minimum/maximum 


median 


1 


1.00 


1.00 


2 


.682 


.682 


3 


.560 


.449 


4 


.492 


.361 


5 


.448 


.287 


6 


.416 


.246 


7 


.392 


.210 


8 


.373 


.187 


9 


.357 


.166 


10 


.344 


.151 


11 


.333 


.137 


12 


.327 


.127 


13 


.315 


.117 


14 


.308 


.109 


15 


.301 


.102 



Taking a specific order statistic of this expression will modify both moments. The first moment 
is given by Pk + j where /i"* is a shift which depends on the order statistic chosen, but not on 
the class. The first moment of 5°* then, is given by: 

(/3. + A^-)-(/3,+M-) ^ ^ 
s s 

Note that the bias term represents an "average bias" since the contributions due to the order 
statistic are removed. Therefore, reductions in bias cannot be obtained from a table similar to 
Table |l|. 

Now, let us turn our attention to the variance. Since Pk + Vk{xb) has variance cr^^ + (t|^, it 
follows that {Pk + rik{xb))°^ has variance a^os = a{a^^ + cr"^^), where a is the factor discussed in 
Section |4.3| . Since 6°" is a linear combination of error terms, its variance is given by: 

= aial+aj), (36) 

where cr^ = — '-^ — - is the variance introduced by the biases of different classifiers. The result of 
bias then manifests itself both in the mean and the variance of the boundary offset 



We have now obtained the first and second moments of 6°'', and can compute the added error 

.2 



region. Namely, we have = (3 and a^os = M^"* — (Mf"*)^ leading to 



EZM = = (37) 

= '-{a{al+a^,)+P^). (38) 
The reduction in the error is more difficult to assess in this case. By writing the error as: 
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we get: 



(39) 



Analyzing the error reduction in the general case requires knowledge about the bias introduced 
by each classifier. However, it is possible to analyze the extreme cases. If each classifier has the 
same bias for example, (t| is reduced to zero and (3 = f3. In this case the error reduction can be 
expressed as: 



where only the error contribution due to the variance of b is reduced. In this case it is important 
to reduce classifier bias before combining (e.g. by using an overparametrized model). If on the 
other hand, the biases produce a zero mean variable, i.e. they cancel each other out, we obtain 
P = 0. In this case, the added error becomes: 



and the error reduction will be significant as long as cr^ < /?^. 

5 Correlated Classifier Combining 

5.1 Introduction 

The discussion so far focused on finding the types of combiners that improve performance. Yet, 
it is important to note that if the classifiers to be combined repeatedly provide the same (either 
erroneous or correct) classification decisions, there is little to be gained from combining, regardless 
of the chosen scheme. Therefore, the selection and training of the classifiers that will be combined 
is as critical an issue as the selection of the combining method. Indeed, classifier/data selection 
is directly tied to the amount of correlation among the various classifiers, which in turn affects 
the amount of error reduction that can be achieved. 

The tie between error correlation and classifier performance was directly or indirectly observed 
by many researchers. For regression problems, Perrone and Cooper show that their combining 
results are weakened if the networks are not independent |^9|] . Ali and Pazzani discuss the 
relationship between error correlations and error reductions in the context of decision trees ||] . 
Meir discusses the effect of independence on combiner performance [Q , and Jacobs reports that 
N' < N independent classifiers are worth as much as N dependent classifiers [Q. The infiucnce 
of the amount of training on ensemble performance is studied in ||6^ . For classification problems, 
the effect of the correlation among the classifier errors on combiner performance was quantified 
by the authors ||7^ . 

5.2 Combining Unbiased Correlated Classifiers 

In this section we derive the explicit relationship between the correlation among classifier errors 
and the error reduction due to combining. Let us focus on the linear combination of unbiased 
classifiers. Without the independence assumption, the variance of fji is given by: 



1 

7V2 



N N 



4 



^^co^;(r7r(2;),^l(^)) 



1=1 m=l 
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1 ^ 1 ^ 

= E ^'r (-) + E E (^■^" ' ) 

m— 1 m— 1 /^^m 

where coi;(-, •) represents the covariance. Expressing the covariances in term of the correlations 
{cov{x,y) — corr[x,y) Ux Oy), leads to: 

1 ^ 1 ^ 

4 = ]^ E '^'r(-) + ]^ E E corr(C(a;), r?l(a;))a^™(,)fT^,(^). (40) 

m— 1 m — 1 l^m 

In situations where the variance of a given output is comparable across the different classifiers, 
Equation 40 is significantly simplified by using the common variance , thus becoming: 

1 1 ^ 

4 = N^l(x) + ]^ E E corr(7?r(a:), ■ 

m—1 l^m 



Let (5i be the correlation factor among all classifiers for the ith output: 

N 

N (N -1) ^ ■ 

The variance of fji becomes: 

2 _ 1 2 A^- 1 ^ 2 

Now, let us return to the boundary a;"^'^, and its offset fo"^*^, where: 

In Section |3.1^ the variance of b°''"'^ was shown to be: 

2 _ 4+4 



Therefore: 



(^4(.)(i + (^-i)'^') + ^4(.)(i + (^-i)'5j-)) 



1 f 1 

72 



which leads to: 



^fc""" = ^ (4w + 4(^) + (^-1) (^«4(:r) + '5i4(^))) , 



or: 



Recalling that the noise between classes are i.i.d. leads to[]: 

2 1 ,2 , , ^^-^^4(^)'^» + '^J 
'"^^ = ^^^^ + — 



^The errors between classifiers are correlated, not the errors between classes. 
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This expression only considers the error that occur between classes i and j. In order to extend 
this expression to include all the boundaries, we introduce an overall correlation term S. Then, 
the added error is computed in terms of S. The correlation among classifiers is calculated using 
the following expression: 

L 

<5 = ^P.5. (42) 

1=1 

where Pi is the prior probability of class i. The correlation contribution of each class to the 
overall correlation, is proportional to the prior probability of that class. 



EiT(ave)/Err 




Let us now return to the error region analysis. With this formulation the first and second 
moments of 6"^"^ yield: M™"^ = 0, and ilff "^ = cr^a„c. The derivation is identical to that of 
Section 3.1 and the only change is in the relation between and cr^„„e. We then get: 



A/fave 



2 

1 + S{N-1) 



2 " \ N 

E,,, I . ,43) 



N 

The effect of the correlation between the errors of each classifier is readily apparent from 



Equation 43. If the errors are independent, then the second part of the reduction term vanishes 
and the combined error is reduced by N . If on the other hand, the error of each classifier 
has correlation 1, then the error of the combiner is equal to the initial errors and there is no 
improvement due to combining. Figure |^ shows how the variance reduction is affected by N and 
5 (using Equation ^3|) . 

In general, the correlation values lie between these two extremes, and some reduction is 
achieved. It is important to understand the interaction between N and 5 in order to maximize the 
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reduction. As more and more classifiers are used (increasing iV), it becomes increasingly difficult 
to find uncorrelated classifiers. Figure ^ can be used to determine the number of classifiers needed 
for attaining satisfactory combining performance. 



5.3 Combining Biased Correlated Classifiers 

Let us now return to the analysis of biased classifiers. As discussed in Section ||. the boundary 
offset of a single classifier is given by: 

+ (44) 

s 

where (3 = ^' , leading to the following added error term: 

Eadm^^{a!+f3'). (45) 

Let us now focus on the effects of the ave combiner on the boundary. The combiner output 
is now given by: 

fnx)=Mx) + ft + fj,{x), 

where f]i{x) and ft are given in Section ^. The boundary offset, fe""*^ is: 

6-- = ^-(^"^-^^(^^^ + (46) 
s 

where /? is given by Equation The variance of b"'^^ is not affected by the biases, and the 
derivation of Section 5.2 applies to this case as well. 

The first and second moments of b""""^, the boundary offset obtained using the ave combiner, for 
biased, correlated classifiers, are given by: M^""^ — f3 and Afj^"^ — <j1av„ — {(3)'^ ■ The corresponding 
added error region is: 

Using the overall correlation term obtained in the previous section, we can represent this expres- 
sion in terms of the boundary parameters of the single classifier, and the bias reduction factor z 
introduced in Section ^.2| : 

2 VA N +^ 



In order to obtain the error reduction rates, let us introduce r, the factor that will determine the 
final reduction: 



2 (\ ^KN-\) 
N 



(48) 



Now, Equation ^ leads to: 



< \ EaM{P) ■ (49) 



Equation [49| shows the error reduction for correlated, biased classifiers. As long as the biases of 
individual classifiers are reduced by a larger amount than the correlated variances, the reduction 
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will be similar to those in Section 5.2. However, if the biases are not reduced, the improvement 
gains will not be as significant. These results are conceptually identical to those obtained in 
Section but vary in how the bias reduction z relates to iV. In effect, the requirements on 
reducing z are lower than they were previously, since in the presence of bias, the error reduction 
is less than The practical implication of this observation is that, even in the presence of 
bias, the correlation dependent variance reduction term (given in Equation ^3| ) will often be the 
limiting factor, and dictate the error reductions. 



5.4 Discussion 

In this section we established the importance of the correlation among the errors of individual 
classifiers in a combiner system. One can exploit this relationship explicitly by reducing the 
correlation among classifiers that will be combined. Several methods have been proposed for this 
purpose and many researchers are actively exploring this area pof . 

Cross-validation, a statistical method aimed at estimating the "true' error ||2|, can 
also be used to control the amount of correlation among classifiers. By only training individual 
classifiers on overlapping subsets of the data, the correlation can be reduced. The various boost- 
ing algorithms exploit the relationship between corrlation and error rate by training subsequent 
classifiers on training patterns that have been "selected" by earlier classifiers [|l5|, |l^, |l^, |5^ 
thus reducing the correlation among them. Krogh and Vedelsky discuss how cross-validation can 
be used to improve ensemble performance p6[ . Bootstrapping, or generating different training 
sets for each classifier by resampling the original set |17, |l^, |7^, provides another method 



for correlation reduction |47|| . Breiman also addresses this issue, and discusses methods aimed 
at reducing the correlation among estimators §, 0. Twomey and Smith discuss combining 
and resampling in the context of a \-d regression problem [T^ . The use of principal component 
regression to handle multi-coUinearity while combining outputs of multiple regressors, was sug- 
gested in |4^. Another approach to reducing the correlation of classifiers can be found in input 
decimation, or in purposefully withholding some parts of each pattern from a given classifier Jto] . 
Modifying the training of individual classifiers in order to obtain less correlated classifiers was also 
explored , and the selection of individual classifier through a genetic algorithm is suggested 
in @. 

In theory, reducing the correlation among classifiers that are combined increases the ensemble 
classification rates. In practice however, since each classifier uses a subset of the training data, 
individual classifier performance can deteriorate, thus offsetting any potential gains at the ensem- 
ble level l?^. It is therefore crucial to reduce the correlations without increasing the individual 
classifiers' error rates. 



6 Experimental Combining Results 

In order to provide in depth analysis and to demonstrate the result on public domain data 
sets, we have divided this section into two parts. First we will provide detailed experimental 
results on one difficult data set, outlining all the relevant design steps/parameters. Then we will 
summarize results on several public domain data sets taken from the UCI depository/Probenl 
benchmarks [50t| . 



17 



6.1 Oceanic Data Set 



The experimental data set used in this section is derived from underwater SONAR signals. From 
the original SONAR signals of four different underwater objects, two feature sets are extracted 
|2^. The first one (FSl), a 25-dimensional set, consists of Gabor wavelet coefficients, temporal 
descriptors and spectral measurements. The second feature set (FS2), a 24-dimensional set, 
consists of reflection coefficients based on both short and long time windows, and temporal 
descriptors. Each set consists of 496 training and 823 test patterns. The data is available at 
URL http:/ / www. lans. ece. utexas. edu . 



6.1.1 Combining Results 

fn this section we present detailed results obtained from the Oceanic data described above. Two 
types of feed forward networks, namely a multi-layered perceptron (MLP) with a single hidden 
layer with 50 units and a radial basis function (RBF) network with 50 kernels, are used to classify 
the patterns. Table || provides the test set results for individual classifier/feature set pairs. The 
reported error percentages are averaged over 20 runs. Tables ^ and ^ show the combining results 
for each feature set. Combining consists of utilizing the outputs of multiple MLPs, RBFs or an 
MLP /RBF mix, and performing the operations described in Equations ^ and |2^. When 

combining an odd number of classifiers, the classifier with the better performance is selected once 
more than the less successful one. For example, when combining the MLP and RBF results on 
FSl for = 5, three RBF networks and two MLPs are used. Table || shows the improvements 
that are obtained if more than one feature set is availablen- 



Table 2: Individual Classifier Performance on Test Set. 



Classifier/ 


Error Rate 


St. dev. 


Feature Set 






FSl/MLP 


7.47 


0.44 


FSl/RBF 


6.79 


0.41 


FS2/MLP 


9.95 


0.74 


FS2/RBF 


10.94 


0.93 



The performance of the ave combiner is better than that of the os combiners, especially for 
the second feature set (FS2). While combining information from two different feature sets, the 
linear combiner performed best with the RBF classifiers, while the max combiner performed best 
with the MLP classifiers. Furthermore, using different types of classifiers does not change the 
performance of the linear combiner when qualitatively different feature sets are used. However, 
for the OS combiners, the results do improve when both different classifier types and different 
feature sets are used. 



6.1.2 Correlation Factors 



Let us now estimate the correlation factors among the different classifiers in order to determine 



the compatibility of the various classifier/feature set pairs. The data presented in Section 6.1 
will be used in this section. Table ^ shows the estimated average error correlations between: 

®A11 the combining results provide improvements that are statistically significant over the individual classifiers, 
or more precisely, the hypothesis that the two means are equal (t— test) is rejected for a = .05. 
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Table 3: Combining Results for FSl. 



Classifier(s) 


Ave 


Med 


Max 


Min 




AT 
IN 


Error 


(T 


Error 


(7 


Error 


(7 


Error 


a 




3 


7.19 


0.29 


7.25 


0.21 


7.38 


0.37 


7.19 


0.37 


MLP 


5 


7.13 


0.27 


7.30 


0.29 


7.32 


0.41 


7.20 


0.37 




7 


7.11 


0.23 


7.27 


0.29 


7.27 


0.37 


7.35 


0.30 




3 


6.15 


0.30 


6.42 


0.29 


6.22 


0.34 


6.30 


0.40 


RBF 


5 


6.05 


0.20 


6.23 


0.18 


6.12 


0.34 


6.06 


0.39 




7 


5.97 


0.22 


6.25 


0.20 


6.03 


0.35 


5.92 


0.31 




3 


6.11 


0.34 


6.02 


0.33 


6.48 


0.43 


6.89 


0.29 


BOTH 


5 


6.11 


0.31 


5.76 


0.29 


6.59 


0.40 


6.89 


0.24 




7 


6.08 


0.32 


5.67 


0.27 


6.68 


0.41 


6.90 


0.26 



Table 4: Combining Results for FS2. 



Classifier(s) 


Ave 


Med 


Max 


Min 




N 


Error 


a 


Error 


a 


Error 


a 


Error 


a 




3 


9.32 


0.35 


9.47 


0.47 


9.64 


0.47 


9.39 


0.34 


MLP 


5 


9.20 


0.30 


9.22 


0.30 


9.73 


0.44 


9.27 


0.30 




7 


9.07 


0.36 


9.11 


0.29 


9.80 


0.48 


9.25 


0.36 




3 


10.55 


0.45 


10.49 


0.42 


10.59 


0.57 


10.74 


0.34 


RBF 


5 


10.43 


0.30 


10.51 


0.34 


10.55 


0.40 


10.65 


0.37 




7 


10.44 


0.32 


10.46 


0.31 


10.58 


0.43 


10.66 


0.39 




3 


8.46 


0.57 


9.20 


0.49 


8.65 


0.47 


9.56 


0.53 


BOTH 


5 


8.17 


0.41 


8.97 


0.54 


8.71 


0.36 


9.50 


0.45 




7 


8.14 


0.28 


8.85 


0.45 


8.79 


0.40 


9.40 


0.39 



• different runs of a single classifier on a single feature set (first four rows); 

• different classifiers trained with a single feature set (fifth and sixth rows); 

• single classifier trained on two different feature sets (seventh and eighth rows). 

There is a striking similarity between these correlation results and the improvements obtained 
through combining. When different runs of a single classifier are combined using only one feature 
set, the combining improvements are very modest. These are also the cases where the classifier 
correlation coefficients are the highest. Mixing different classifiers reduces the correlation, and in 
most cases, improves the combining results. The most drastic improvements are obtained when 
two qualitatively different feature sets are used, which are also the cases with the lowest classifier 
correlations. 



6.2 Probenl Benchmarks 



In this section, examples from the Probenl benchmark set^ are used to study the benefits of 
combining ||5^ . Table |^ shows the test set error rate for both the MLP and the RBF classifiers 



^Available from: 



ftp :/ /ftp . ira. uka.de/pub /papers / techreports / 1 994 / 1 994 -21.ps.Z. 



19 



Table 5: Combining Results when Both Feature Sets are Used. 



Classifier(s) 


Ave 


Med 


Max 


Min 




AT 
IN 


Error 


a 


Error 


a 


Error 


a 


Error 


a 




3 


5.21 


0.33 


6.25 


0.36 


4.37 


0.41 


4.72 


0.28 


MLP 


5 


4.63 


0.35 


5.64 


0.32 


4.22 


0.41 


4.58 


0.17 




7 


4.20 


0.40 


5.29 


0.28 


4.13 


0.34 


4.51 


0.20 




3 


3.70 


0.33 


5.78 


0.32 


4.76 


0.37 


3.93 


0.50 


RBF 


5 


3.40 


0.21 


5.38 


0.38 


4.73 


0.35 


3.83 


0.43 




7 


3.42 


0.21 


5.15 


0.31 


4.70 


0.36 


3.76 


0.33 




3 


3.94 


0.24 


4.52 


0.29 


4.34 


0.42 


4.51 


0.30 


BOTH 


5 


3.42 


0.23 


4.35 


0.32 


4.13 


0.49 


4.48 


0.29 




7 


3.40 


0.26 


4.05 


0.29 


4.10 


0.36 


4.39 


0.24 



Table 6: Experimental Correlation Factors Between Classifier Errors. 



Feature Set/Classifier Pairs 


Estimated Correlation 


Two runs of FSl/MLP 


0.89 


Two runs of FSl/RBF 


0.79 


Two runs of FS2/MLP 


0.79 


Two runs of FS2/RBF 


0.77 


FSl/MLP and FSl/RBF 


0.38 


FS2/MLP and FS2/RBF 


0.21 


FSl/MLP and FS2/MLP 


-0.06 


FSl/RBF and FS2/RBF 


-0.21 



on six different data sets taken from the Probenl benchmarks 



Table 7: Performance of Individual Classifiers on the Test Set. 





MLP 


RBF 


Probenl-A 


Probenl-B 


Error 


a 


Error 


cr 


Error 


a 


Error 


a 


CANCERl 


0.69 


0.23 


1.49 


0.79 


1.47 


0.64 


1.38 


0.49 


CARDl 


13.87 


0.76 


13.98 


0.95 


13.64 


0.85 


14.05 


1.03 


DIABETES 1 


23.52 


0.72 


24.87 


1.51 


24.57 


3.53 


24.10 


1.91 


GENEl 


13.47 


0.44 


14.62 


0.42 


15.05 


0.89 


16.67 


3.75 


GLASSl 


32.26 


0.57 


31.79 


3.49 


39.03 


8.14 


32.70 


5.34 


SOYBEANl 


7.35 


0.90 


7.88 


0.75 


9.06 


0.80 


29.40 


2.50 



The six data sets used here are CANCERl, DIABETESl, CARDl, GENEl, GLASSl and 
SOYBEANl. The name and number combinations correspond to a specific training/validation/test 
set splitp^ In all cases, training was stopped when the test set error reached a plateau. We report 
error percentages on the test set, and the standard deviation on those values based on 20 runs. 

CANCERl is based on breast cancer data, obtained from the University of Wisconsin Hospi- 

^''These Probenl results correspond to the "pivot" and "no-shortcut" architectures (A and B respectively), 
reported in [Q. The large error in the Probenl no-shortcut architecture for the SOYBEANl problem is not 
explained. 

^^We are using the same notation as in the Probenl benchmarks. 
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Table 8: Combining Results for CANCERl. 



Classifier(s) 


Ave 


Med 


Max 


Min 




AT 
IN 


Error 


(T 


Error 


(7 


Error 


(7 


Error 


a 




3 


0.60 


0.13 


0.63 


0.17 


0.66 


0.21 


0.66 


0.21 


MLP 


5 


0.60 


0.13 


0.58 


0.00 


0.63 


0.17 


0.63 


0.17 




7 


0.60 


0.13 


0.58 


0.00 


0.60 


0.13 


0.60 


0.13 




3 


1.29 


0.48 


1.12 


0.53 


1.90 


0.52 


0.95 


0.42 


RBF 


5 


1.26 


0.47 


1.12 


0.47 


1.81 


0.58 


0.98 


0.37 




7 


1.32 


0.41 


1.18 


0.43 


1.81 


0.53 


0.89 


0.34 




3 


0.86 


0.39 


0.63 


0.18 


1.03 


0.53 


0.95 


0.42 


BOTH 


5 


0.72 


0.25 


0.72 


0.25 


1.38 


0.43 


0.83 


0.29 




7 


0.86 


0.39 


0.58 


0.00 


1.49 


0.39 


0.83 


0.34 



Table 9: Combining Results for CARPI. 



Classifier(s) 


Ave 


Med 


Max 


Min 




N 


Error 


a 


Error 


cr 


Error 


a 


Error 


a 




3 


13.37 


0.45 


13.61 


0.56 


13.43 


0.44 


13.40 


0.47 


MLP 


5 


13.23 


0.36 


13.40 


0.39 


13.37 


0.45 


13.31 


0.40 




7 


13.20 


0.26 


13.29 


0.33 


13.26 


0.35 


13.20 


0.32 




3 


13.40 


0.70 


13.58 


0.76 


14.01 


0.66 


13.08 


1.05 


RBF 


5 


13.11 


0.60 


13.29 


0.67 


13.95 


0.66 


12.88 


0.98 




7 


13.02 


0.33 


12.99 


0.33 


13.75 


0.76 


12.82 


0.67 




3 


13.75 


0.69 


13.69 


0.70 


13.49 


0.62 


13.66 


0.70 


BOTH 


5 


13.78 


0.55 


13.66 


0.67 


13.66 


0.65 


13.75 


0.64 




7 


13.84 


0.51 


13.52 


0.58 


13.66 


0.60 


13.72 


0.70 



tals, from Dr. William H. Wolberg This set has 9 inputs, 2 outputs and 699 patterns, of 

which 350 is used for training. An MLP with one hidden layer of 10 units, and an RBF network 
with 8 kernels is used with this data. 

The CARDl data set consists of credit approval decisions 51 inputs are used to 

determine whether or not to approve the credit card application of a customer. There are 690 
examples in this set, and 345 are used for training. The MLP has one hidden layer with 20 units, 
and the RBF network has 20 kernels. 

The DIABETES 1 data set is based on personal data of the Pima Indians obtained from 
the National Institute of Diabetes and Digestive and Kidney Diseases [Q. The binary output 
determines whether or not the subjects show signs of diabetes according to the World Health 
Organization. The input consists of 8 attributes, and there are 768 examples in this set, half of 
which are used for training. MLPs with one hidden layer with 10 units, and RBF networks with 
10 kernels are selected for this data set. 

The GENEl is based on intron/exon boundary detection, or the detection of splice junctions 
in DNA sequences 45, 6^. 120 inputs are used to determine whether a DNA section is a donor, 
an acceptor or neither. There are 3175 examples, of which 1588 are used for training. The MLP 
architecture consists of a single hidden layer network with 20 hidden units. The RBF network 
has 10 kernels. 



21 



Table 10: Combining Results for DIABETESl. 



Classifier(s) 


Ave 


Med 


Max 


Min 




AT 
IN 


Error 


a 


Error 


a 


Error 


a 


Error 


(T 




3 


23.15 


0.60 


23.20 


0.53 


23.15 


0.67 


23.15 


0.67 


MLP 


5 


23.02 


0.59 


23.13 


0.53 


22.81 


0.78 


22.76 


0.79 




7 


22.79 


0.57 


23.07 


0.52 


22.89 


0.86 


22.79 


0.88 




3 


24.69 


1.15 


24.77 


1.28 


24.82 


1.07 


24.77 


1.09 


RBF 


5 


24.32 


0.86 


24.35 


0.72 


24.66 


0.81 


24.56 


0.90 




7 


24.22 


0.39 


24.32 


0.62 


24.79 


0.80 


24.35 


0.73 




3 


24.32 


1.14 


23.52 


0.60 


24.35 


1.21 


24.51 


1.07 


BOTH 


5 


24.53 


0.97 


23.49 


0.59 


24.51 


1.16 


24.66 


1.02 




7 


24.43 


0.93 


23.85 


0.63 


23.85 


0.93 


24.53 


0.86 



The GLASSl data set is based on the chemical analysis of glass splinters. The 9 inputs are 
used to classify 6 different types of glass. There are 214 examples in this set, and 107 of them 
are used for training. MLPs with a single hidden layer of 15 units, and RBF networks with 20 
kernels are selected for this data set. 

The SOYBEANl data set consists of 19 classes of soybean, which have to be classified using 
82 input features |Q. There are 683 patterns in this set, of which 342 are used for training. 
MLPs with one hidden layer with 40 units, and RBF networks with 40 kernels are selected. 



Table 11: Combining Results for GENEl. 



Classifier(s) 


Ave 


Med 


Max 


Min 




N 


Error 


a 


Error 


(T 


Error 


a 


Error 


tj 




3 


12.30 


0.42 


12.46 


0.40 


12.73 


0.55 


12.62 


0.56 


MLP 


5 


12.23 


0.40 


12.40 


0.40 


12.67 


0.41 


12.33 


0.57 




7 


12.08 


0.23 


12.27 


0.35 


12.57 


0.31 


12.18 


0.43 




3 


14.48 


0.37 


14.52 


0.30 


14.53 


0.40 


14.42 


0.33 


RBF 


5 


14.35 


0.33 


14.43 


0.29 


14.38 


0.24 


14.36 


0.35 




7 


14.33 


0.35 


14.40 


0.24 


14.28 


0.18 


14.33 


0.32 




3 


12.43 


0.48 


12.67 


0.32 


12.87 


0.65 


12.77 


0.51 


BOTH 


5 


12.28 


0.40 


12.54 


0.35 


12.80 


0.54 


12.47 


0.65 




7 


12.17 


0.36 


12.69 


0.35 


12.70 


0.46 


12.25 


0.66 



Tables |§ - |l^ show the performance of the ave and os combiners. From these results, we see 
that improvements are modest in general. However, recall that the reduction factors obtained in 
the previous sections are on the added errors, not the overall error. For the Probenl problems, 
individual classifiers are performing well (as well or better than the results reported in pO| in 
most cases) and it is therefore difficult to improve the results drastically. However, even in those 
cases, combining provides an advantage: although the classification rates are not dramatically 
better, they are more reliable. Indeed, a lower standard deviation means the results are less 
dependent on outside factors such as initial conditions and training regime. In some cases all 20 
instances of the combiner provide the same result, and the standard deviation is reduced to zero. 
This can be seen in both the CANCERl and SOYBEANl data sets. 

One important observation that emerges from these experiments is that combining two differ- 
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Table 12: Combining Results for GLASSl. 



Classifier(s) 


Ave 


Med 


Max 


Min 




AT 
IN 


Error 


a 


Error 


a 


Error 


a 


Error 


a 




7 


32.07 


0.00 


32.07 


0.00 


32.07 


0.00 


32.07 


0.00 


MLP 


5 


32.07 


0.00 


32.07 


0.00 


32.07 


0.00 


32.07 


0.00 




7 


32.07 


0.00 


32.07 


0.00 


32.07 


0.00 


32.07 


0.00 




3 


29.81 


2.28 


30.76 


2.74 


30.28 


2.02 


29.43 


2.89 


RBF 


5 


29.25 


1.84 


30.19 


1.69 


30.85 


2.00 


28.30 


2.46 




7 


29.06 


1.51 


30.00 


1.88 


31.89 


1.78 


27.55 


1.83 




3 


30.66 


2.52 


29.06 


2.02 


33.87 


1.74 


29.91 


2.25 


BOTH 


5 


32.36 


1.82 


28.30 


1.46 


33.68 


1.82 


29.72 


1.78 




7 


32.45 


0.96 


27.93 


1.75 


34.15 


1.68 


29.91 


1.61 



Table 13: Combining Results for SOYBEANl. 



Classifier(s) 


Ave 


Med 


Max 


Min 




N 


Error 


a 


Error 


cr 


Error 


a 


Error 


a 




3 


7.06 


0.00 


7.09 


0.13 


7.06 


0.00 


7.85 


1.42 


MLP 


5 


7.06 


0.00 


7.06 


0.00 


7.06 


0.00 


8.38 


1.63 




7 


7.06 


0.00 


7.06 


0.00 


7.06 


0.00 


8.88 


1.68 




3 


7.74 


0.47 


7.65 


0.42 


7.85 


0.47 


7.77 


0.44 


RBF 


5 


7.62 


0.23 


7.68 


0.30 


7.77 


0.30 


7.65 


0.42 




7 


7.68 


0.23 


7.82 


0.33 


7.68 


0.29 


7.59 


0.45 




3 


7.18 


0.23 


7.12 


0.17 


7.56 


0.28 


7.85 


1.27 


BOTH 


5 


7.18 


0.23 


7.12 


0.17 


7.50 


0.25 


8.06 


1.22 




7 


7.18 


0.24 


7.18 


0.23 


7.50 


0.25 


8.09 


1.05 



ent types of classifiers does not necessarily improve upon (or in some cases, even reach) the error 
rates obtained by combining multiple runs of the better classifier. This apparent inconsistency is 
caused by two factors. First, as described in section \2 the reduction factor is limited by the bias 
reduction in most cases. If the combined bias is not lowered, the combiner will not outperform 
the better classifier. Second, as discussed in section 5.2, the correlation plays a major role in the 
final reduction factor. There are no guarantees that using different types of classifiers will reduce 
the correlation factors. Therefore, the combining of different types of classifiers, especially when 
their respective performances are significantly different (the error rate for the RBF network on 
the CANCERl data set is over twice the error rate for MLPs) has to be treated with caution. 

Determining which combiner (e.g. ave or med), or which classifier selection (e.g. multiple 
MLPs or MLPs and RBFs) will perform best in a given situation is not generally an easy task. 
However, some information can be extracted from the experimental results. The linear combiner, 
for example, appears more compatible with the MLP classifiers than with the RBF networks. 
When combining two types of network, the med combiner often performs better than other com- 
biners. One reason for this is that the outputs that will be combined come from different sources, 
and selecting the largest or smallest value can favor one type of network over another. These 
results emphasize the need for closely coupling the problem at hand with a classifier/combiner. 
There does not seem to be a single type of network or combiner that can be labeled "best" under 
all circumstances. 
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7 Discussion 



Combining the outputs of several classifiers before making the classification decision, has led to 
improved performance in many applications p7| , [8^ . This article presents a mathematical 
framework that underlines the reasons for expecting such improvements and quantifies the gains 
achieved. We show that combining classifiers in output space reduces the variance in boundary 
locations about the optimum (Bayes) boundary decision. Moreover, the added error regions 
associated with different classifiers are directly computed and given in terms of the boundary 
distribution parameters. In the absence of classifier bias, the reduction in the added error is 
directly proportional to the reduction in the variance. For linear combiners, if the errors of 
individual classifiers are zero-mean i.i.d., the reduction in boundary variance is shown to be 
N, the number of classifiers that are combined. When the classifiers are biased, and/or have 
correlated outputs, the reductions are less than N. 

Order statistics combiners are discussed as an alternative to linear methods, and are motivated 
by their ability to extract the "right" amount of information. We study this family of combiners 
analytically, and we present experimental results showing that os combiners improve upon the 
performance of individual classifiers. During the derivation of the main result, the decision 
boundary is treated as a random variable without specific distribution assumptions. However, 
in order to obtain the table of reduction factors for the order statistics combiners, a specific 
error model needed to be adopted. Since there may be a multitude of factors contributing to the 
errors, we have chosen the Gaussian model. Reductions for several other noise models can be 
obtained from similar tables available in order statistics textbooks |T^. The expected error 
given in Equation ^ is in general form, and any density function can be used to reflect changes 
in the distribution function. 

Although most of our analysis focuses on two classes, it is readily applicable to multi-class 
problems. In general, around a boundary decision, the error is governed by the two (locally) 
dominant classes. Therefore, even in a multi-class problem, one only needs to consider the two 
classes with the highest activation values (i.e., highest posterior) in a given localized region. 

Another important feature that arises from this study provides a new look to the classic 
bias/ variance dilemma. Combining provides a method for decoupling the two components of 
the error to a degree, allowing a reduction in the overall error. Bias in the individual classifiers 
can be reduced by using larger classifiers than required, and the increased variance due to the 
larger classifiers can be reduced during the combining stage. Studying the effects of this coupling 
between different errors and distinguishing situations that lead to the highest error reduction 
rates are the driving motivations behind this work. That goal is attained by clarifying the 
relationship between output space combining and classification performance. 

Several practical issues that relate to this analysis can now be addressed. First, let us note that 
since in general each individual classifier will have some amount of bias, the actual improvements 
will be less radical than those obtained in Section 3T. It is therefore important to determine 
how to keep the individual biases minimally correlated. One method is to use classifiers with 
paradigms/architectures based on different principles. For example, using multi-layered percep- 
trons and radial basis function networks provides both global and local information processing, 
shows less correlation than if classifiers of only one type were used. Other methods such as 
resampling, cross-validation or actively promoting diversity among classifiers can also be used, 
as long as they do not adversely affect the individual classification results. 

The amount of training that is required before classifiers are combined is also an interesting 
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question. If a combiner can overcome overtraining or undertraining, new training regimes could 
be used for classifiers that will be combined. We have observed that combiners do compensate for 
overtraining, but not undertraining (except in cases where the undertraining is very mild). This 
corroborates well with the theoretical framework which shows combining to be more effective at 
variance reduction than bias reduction. 

The classification rates obtained by the order statistics combiners in section |^ are in gen- 
eral, comparable to those obtained by averaging. The advantage of OS approaches should be 
more evident in situations where there is substantial variability in the performance of individual 
classifiers, and the thus robust properties of OS combining can be brought to bear upon. Such 
variability in individual performance may be due to, for example, the classifiers being geograph- 
ically distributed and working only on locally available data of highly varying quality. Current 
work by the authors indicate that this is indeed the case, but the issue needs to be examined in 
greater detail. 

One final note that needs to be considered is the behavior of combiners for a large number 
of classifiers {N) . Clearly, the errors cannot be arbitrarily reduced by increasing N indefinitely. 
This observation however, does not contradict the results presented in this analysis. For large 
N, the assumption that the errors were i.i.d. breaks down, reducing the improvements due to 
each extra classifier. The number of classifiers that yield the best results depends on a number of 
factors, including the number of feature sets extracted from the data, their dimensionality, and 
the selection of the network architectures. 
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